class: center, middle, inverse, title-slide .title[ # Lecture 4: Data Cleaning ] .author[ ### James Sears*
AFRE 891/991 FS 25
Michigan State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Advanced Data Analytics”
by Nick Hagerty and
“Data Science for Economists”
by Grant McDermott.] ] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Table of Contents 1. [Prologue](#prologue) 1. [Paths and Importing Data](#import) 1. [Keys and Relational Data](#keys) 1. [String Cleaning](#string) 1. [Number Storage](#numbers) 1. [Data Cleaning Checklist](#check) --- class: inverse, middle name: prologue # Prologue --- # Data Cleaning No that we know how to wrangle data in R, it's time to talk more specifically about .hi-medgrn[Data Cleaning] - both the concerns to keep in mind while processing data and the "nitty gritty" of how to implement the necessary steps. * .hi-blue[Importing data] * .hi-medgrn[Keys and relational data] * .hi-pink[Cleaning character strings] * .hi-purple[Number Storage] -- Packages we will use today: * .hi-slate[stringr] * .hi-slate[tidyverse] * .hi-slate[nycflights13] ``` r pacman::p_load(haven, sjlabelled, stringr, tidyverse, nycflights13) ``` --- class: inverse, middle name: import # Paths and Importing Data --- # Paths and Directories The .hi-medgrn[working directory] is your .hi-medgrn["current location"] in the filesystem. What's your current working directory? ``` r getwd() ``` ``` ## [1] "C:/Users/searsja1/OneDrive - Michigan State University/Github/AFRE-891-991-FS25/Lecture Slides/04-Cleaning" ``` * This is an example of a full or .hi-slate[absolute path]. * Usually starts with `C:/` on Windows or `/` on Mac. * Defaults to the .hi-medgrn[folder containing your script/Rmd file] --- # Paths and Directories In contrast, .hi-blue[relative paths] are defined .hi-blue[relative to] the full path of the working directory. * Let's say your working directory is `"C:/Github/AFRE-891-991-FS25/"` -- * You have a file saved at `"C:/Github/AFRE-891-991-FS25/assignment-1/assignment-1.Rmd"` * This is it's .hi-medgrn[absolute path] -- * Its .hi-blue[relative path] would be .hi-blue[`"assignment-1/assignment-1.Rmd"`] -- * Relies on the folder/directory nesting within your filesystem In R you can use either an absolute or relative path in any given situation, but .hi-blue[relative paths] are usually .hi-blue[easier to work with]. --- # Paths and Directories: Best Practices .hi-blue[Option 1: use relative paths within GitHub repositories] * In the main folder (root directory), place * A **README** file that gives a basic overview of your project. * A **master script/R Markdown file** that lists and runs all other scripts * Use paths .hi-blue[relative] to included folder structure specific to each type of input/output, a la... --- # Paths and Directories: Best Practices ``` r /my_project /rawData /processedData /code /1_clean /2_process /3_results /output /tables /sumStats /regression /figures /estimates ``` The relative path to access a processed data file named `parcels.dta` is then .center[ `"processedData/parcels.dta"` ] --- # Paths and Directories: Best Practices .hi-purple[Option 2: use relative paths within other Version Control] * Even if you're not using GitHub, you can use a .hi-purple[similar folder structure] for projects .hi-purple[saved locally] * Back up files/sync across computers with cloud storage (a la OneDrive) --- # Download this data Most data does not come nicely in R packages! You will download it from somewhere and load it in R. [Go here: NYTimes COVID-19 Data](https://github.com/nytimes/covid-19-data/tree/master/colleges) and click on the `colleges.csv` file, then the .hi-medgrn[Download raw file] link. * This is a list of COVID-19 case counts reported at U.S. colleges and universities between July 2020 and May 2021. .hi-blue[Save] this file .hi-blue[somewhere sensible] on your computer * Perhaps the "data" subfolder in your cloned Lecture 4 Slides folder? --- # Setting the Working Directory You can change your working directory with `setwd(dir)`. Now: 1. .hi-medgrn[Find the location you saved your CSV file and copy the filepath.] * Manual navigation or use the .hi-slate[Files] window in bottom-right .pull-left[ <img src = "images/nav_dir1.png" width = "350" /img> ] .pull-right[ <img src = "images/nav_dir2.png" width = "350" /img> ] --- # Setting the Working Directory You can change your working directory with `setwd(dir)`. Now: 1. .hi-medgrn[Find the location you saved your CSV file and copy the filepath.] 2. .hi-blue[Use the console in R to set your working directory to that location.] For example: ``` r setwd("F:/OneDrive - Michigan State University/Github/covid_ex/") ``` --- # setwd Best Practices .hi-purple[Pro tip: minimize use of `setwd()` in scripts.] * Someone else's working directory will be different from yours! * You want your code to be portable. * .hi-blue[Best:] build a repository/project folder and use a master script/Rmd that .hi-blue[automatically uses relative file paths]! * .hi-medgrn[Better:] set working directory to the project folder .hi-medgrn[outside the script] (like we just did) * .hi-pink[Good:] declare a .hi-pink[`maindir`] path to your project folder at the start of your script, set working directory to that path * .hi-slate[Worst:] changing working directories more than once in a script (barf) --- # Read in Data (with readr) The main reason we bother with working directories is to let us .hi-medgrn[read in and interact with data.] -- The tidyverse package `readr` provides lots of options to read data into R. Read in the college COVID data using `read_csv` and the .hi-purple[relative filepath]: ``` r col <- read_csv("data/colleges.csv") ``` `View` this data to take a look at it. ``` r str(col) ``` ``` ## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ... ## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ... ## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ... ## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ... ## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ... ## $ cases : num [1:1948] 41 2 45 2742 220 ... ## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ... ## $ notes : chr [1:1948] NA NA NA NA ... ## - attr(*, "spec")= ## .. cols( ## .. date = col_date(format = ""), ## .. state = col_character(), ## .. county = col_character(), ## .. city = col_character(), ## .. ipeds_id = col_character(), ## .. college = col_character(), ## .. cases = col_double(), ## .. cases_2021 = col_double(), ## .. notes = col_character() ## .. ) ## - attr(*, "problems")=<externalptr> ``` --- # Reading in Data with readr .hi-slate[readr] can read a wide set of .hi-medgrn[plain text and delimited] files, as well as .hi-purple[R Data files] | File Type | Function | Use| |----|-----|-----| | CSV | `read_csv` | comma delimited<sup>1</sup> | | CSV | `read_csv2` | semicolon delimited, comma decimal mark | | TSV | `read_tsv` | tab delimited | | Delimited Plain Text | `read_delim` | Any delimiter | | Fixed Width Text | `read_fwf` | Fixed width | | R Data | `read_rds` | storage-efficient native R files | .footnote[More on readr options [.hi-dkorange[here.]](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_data-import.pdf)] --- # File Types and Storage Size Note that `.rds` files are considerably .hi-medgrn[more storage-efficient than STATA files]: <img src="data:image/png;base64,#images/file_sizes.png" width="80%" style="display: block; margin: auto;" /> --- # Reading in Data with readr .hi-slate[readr] will guess at column types, but often it makes sense to .hi-medgrn[manually specify what column types are] * i.e. FIPS codes with leading zeroes as character strings -- Include the `col_types = list()` argument to tell readr what the column types are before reading in: .font90[ ``` r col <- read_csv("data/colleges.csv", col_types = list( date = col_date(), state = col_character(), county = col_character(), city = col_character(), ipeds_id = col_character(), college = col_character(), cases = col_integer(), cases_2021 = col_integer(), notes = col_character() )) ``` ] --- # Reading in Data with readr We can also use .hi-medgrn[string abbreviations] in the list to simplify: ``` r col <- read_csv("data/colleges.csv", col_types = list( date = "d", # can mix abbreviations/functions state = col_character(), county = "c", city = "c", ipeds_id = "c", college = "c", cases = "i", cases_2021 = "i", notes = "c" )) ``` --- # Reading in Data with readr Or if we don't want to specify variable names, we can use a .hi-blue[single string of abbreviations]: ``` r col <- read_csv("data/colleges.csv", * col_types = "dccccciic" ) ``` --- # Reading in Online Data Note that we can read files in .hi-medgrn[directly from the internet] without downloading them first * You .hi-blue[usually shouldn't] - this is terrible for .hi-blue[retention]! * What happens if the file location changes? * Or the host updates the year in the file name? ``` r url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv" col <- read_csv(url) ``` --- # Reading in Data: STATA Files The .hi-slate[haven] package's `read_dta()` works great for reading in STATA-formatted `.dta` files. * For example: hospital capacities by county for 2020-2021: ``` r hosp_cap <- haven::read_dta("data/hospital_capacity.dta") ``` Note that, as in Stata, the variables .hi-medgrn[contain labels]. These are visible when viewing and are stored as a .hi-blue[variable attribute]: ``` r attr(hosp_cap$fips_code, "label") ``` ``` ## [1] "County FIPS" ``` --- # Labels We *could* modify labels directly with `attr(df$var, "label") <- "Label"` ``` r attr(hosp_cap$fips_code, "label") ``` ``` ## [1] "County FIPS" ``` ``` r attr(hosp_cap$fips_code, "label") <- "FIPS Code" attr(hosp_cap$fips_code, "label") ``` ``` ## [1] "FIPS Code" ``` But that's a pain and doesn't fit into our tidy workflow. -- Instead, let's us the .hi-slate[sjlabelled] package and its methods for labels --- # Labeled Data .hi-slate[sjlabelled] includes methods for both .hi-medgrn[variable labels] *and* .hi-blue[data labels] -- .pull-left[ .center.hi-medgrn[Variable Labels] * `get_label()`: retrieve variable labels * `set_label()`: set variable labels * `var_labels()`: pipeable way of setting labels ] -- .pull-right[ .center.hi-blue[Data Labels] * `get_labels()`: retrieve data labels * `set_labels()`: set data labels ] --- # Getting Variable Labels Use `get_label()` to... get the variable labels. * Returns a named character vector -- One variable: ``` r get_label(hosp_cap$fips_code) ``` ``` ## [1] "FIPS Code" ``` Or all at once: ``` r get_label(hosp_cap) ``` ``` ## fips_code collection_week n_hosp ## "FIPS Code" "Collection Week" "# Hospitals" ## beds_total beds_inpatient beds_icu ## "Total Hospital Beds" "Inpatient Hospital Beds" "ICU Beds" ## beds_adult beds_adult_inpatient beds_ped_inpatient ## "Adult Hospital Beds" "Adult Inpatient Beds" "" ``` --- # Setting Variable Labels Use `set_label()` to assign labels. For example, `beds_patient_inpatient` is missing its label. Let's add it: ``` r set_label(hosp_cap$beds_ped_inpatient) <- "Pediatric Inpatient Beds" get_label(hosp_cap$beds_ped_inpatient) ``` ``` ## [1] "Pediatric Inpatient Beds" ``` -- We could also use `set_label()` to set .hi-medgrn[all variable labels at once with the syntax] .center[ `hosp_cap <- set_label(hosp_cap, c("Var 1", "Var 2",..., "Var K"))` ] --- # Setting Variable Labels (Pipeable) Alternatively, use the tidy-conforming function `var_labels()`. * Syntax works like `rename(df, var1 = "Label 1", varj = "Label J")` ``` r hosp_cap_lab <- hosp_cap %>% var_labels(fips_code = "County FIPS Code", n_hosp = "Hospital Count") get_label(hosp_cap_lab) ``` ``` ## fips_code collection_week ## "County FIPS Code" "Collection Week" ## n_hosp beds_total ## "Hospital Count" "Total Hospital Beds" ## beds_inpatient beds_icu ## "Inpatient Hospital Beds" "ICU Beds" ## beds_adult beds_adult_inpatient ## "Adult Hospital Beds" "Adult Inpatient Beds" ## beds_ped_inpatient ## "Pediatric Inpatient Beds" ``` --- # Getting Value Labels We can use `get_labels()` in much the same way as `get_label()` to retrieve the .hi-blue[value labels]. Currently, we have no value labels on our numeric variables (say, `n_hosp`): ``` r get_labels(hosp_cap$n_hosp) ``` ``` ## NULL ``` -- While the level of each character is counted as its label by default. ``` r get_labels(hosp_cap$fips_code) %>% head(50) ``` ``` ## [1] "01001" "01003" "01005" "01007" "01009" "01011" "01013" "01015" "01017" ## [10] "01019" "01021" "01023" "01025" "01027" "01031" "01033" "01035" "01039" ## [19] "01041" "01043" "01045" "01047" "01049" "01051" "01053" "01055" "01057" ## [28] "01059" "01061" "01063" "01065" "01069" "01071" "01073" "01077" "01079" ## [37] "01081" "01083" "01089" "01091" "01093" "01095" "01097" "01099" "01101" ## [46] "01103" "01109" "01111" "01113" "01115" ``` --- # Setting Value Labels We can use `set_labels()` to set the variable labels for a specific value(s) of a variable: .center[ `set_labels(df, var1, var2,..., labels = c("label1" = value1, "label2" = value2,...))` ] ``` r hosp_cap_lab <- set_labels(hosp_cap_lab, fips_code, labels = c("Autauga County, Alabama" = "01001") ) get_labels(hosp_cap_lab$fips_code) ``` ``` ## [1] "Autauga County, Alabama" ``` --- # Setting Value Labels To set all levels of a variable, we can just pass a vector of labels .center[ `set_labels(df, var1, var2,..., labels = c(label1, label2,...,labelJ))` ] Note that all this also works for vectors too! ``` r vec <- c(10, 5, 4, 6, 8, 5, 4, 10, 10, 6) get_labels(vec) ``` ``` ## NULL ``` Adding in text-based labels for each level ``` r label_vec <- c(10, 5, 4, 6, 8) names(label_vec) <- c("Ten", "Five", "Four", "Six", "Eight") vec <- set_labels(vec, labels = label_vec) vec ``` ``` ## [1] 10 5 4 6 8 5 4 10 10 6 ## attr(,"labels") ## Four Five Six Eight Ten ## 4 5 6 8 10 ``` --- # Reading in Data: Other Formats Often we need to read in other data types, for which we'll need .hi-slate[other packages] | File Type | Function(s) | Package| |----|-----|-----| | CSV | `fread` | .hi-slate[data.table] (good for large files) | | Excel (.xlsx, .xls) | `read_excel` | .hi-slate[readxl] | | Google Sheets | `read_sheet` | .hi-slate[googlesheets4] | | Stata, SAS, SPSS | `read_dta/read_sas/read_sav` | .hi-slate[haven] | | R Data (.rds) | `readRDS` | .hi-slate[base R]| --- # Challenge .hi-slate[Which state had the least total reported Covid-19 cases at colleges and universities?] * Was it Michigan? --- # Writing Out Data with readr .hi-slate[readr] also makes it easy to write (save) out processed files with the `write_X(object, path)` functions. | File Type | Function | Use| |----|-----|-----| | CSV | `write_csv` | comma delimited<sup>1</sup> | | CSV | `write_csv2` | semicolon delimited, comma decimal mark | | TSV | `write_tsv` | tab delimited | | Delimited Plain Text | `write_delim` | Any delimiter | | Fixed Width Text | `write_fwf` | Fixed width | | R Data | `write_rds` | storage-efficient native R files | --- # Writing Out Data with readr Let's say we want to produce a summary statistics table of .hi-blue[all cases] across .hi-blue[all colleges in a state], sorted .hi-blue[most to least] for the .hi-blue[10 states with highest caseloads] Using our data wrangling skills from last lecture, how could we write this? --- # Writing Out Data with readr ``` r state_tab <- group_by(col, state) %>% summarise(total_cases = sum(cases, na.rm = T)) %>% arrange(desc(total_cases)) %>% ungroup() %>% filter(row_number() <= 10) ``` -- We can then save it out as a CSV with `write_csv(object, path)` * Note that we only need to supply the .hi-blue[relative path] ``` r write_csv(state_tab, "data/state_top10.csv") ``` --- class: inverse, middle name: keys # Keys and Relational Data Images in this section are from [.hi-dkorange[R for Data Science]](https://r4ds.had.co.nz/relational-data.html) by Wickham & Grolemund, used under [.hi-dkorange[CC BY-NC-ND 3.0]](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license. --- # Relational data More often than not, we'll be working with .hi-medgrn[relational data:] multiple tables of data that have relations to each other .pull-left[ .center[<img src = "data:image/png;base64,#images/relational-nycflights.png" width = "500" />] ] .font90.pull-right[ * `flights` connects to `planes` via a single variable, `tailnum`. * `flights` connects to `airlines` through the `carrier` variable. * `flights` connects to `airports` in two ways: via the `origin` and `dest` variables. * `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time). ] --- # Keys To join relation data, we need .hi-blue[key variable(s)] that .hi-blue[uniquely identifies an observation]. * In `planes`, the key is `tailnum`. -- * In `weather`, the key consists of 5 variables: (`year`, `month`, `day`, `hour`, `origin`). --- # Keys There are two types of keys: 1. A .hi-medgrn[primary key] uniquely identifies an observation in .hi-medgrn[its own data frame]. * `planes$tailnum` is a .hi-medgrn[primary key] because it uniquely identifies each plane in the `planes` data frame. -- 1. A .hi-purple[foreign key] uniquely identifies an observation in .hi-purple[another data frame]. * `flights$tailnum` is a .hi-purple[foreign key] because it appears in the `flights` data frame where it matches each flight to a unique plane. -- A variable can be .hi-slate[both a primary key *and* a foreign key]. * For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` data frame. --- # Keys The .hi-medgrn[primary key] is the .hi-slate[first thing] you need to know about a new data frame. Once you think you know the primary key, .hi-medgrn[verify it]. Here's one way to do that: ``` r planes_dup <- rbind(planes, planes[1:100,]) %>% arrange(tailnum) %>% mutate(n = row_number()) planes_dist <- distinct(planes_dup, .keep_all = T) planes_dist2 <- group_by(planes_dup, tailnum) %>% mutate(count = row_number()) %>% filter(count == 1) %>% ungroup() View(flights) flights %>% count(month, day, dep_time, carrier, flight, tailnum) %>% filter(n > 1) ``` ``` ## # A tibble: 0 × 7 ## # ℹ 7 variables: month <int>, day <int>, dep_time <int>, carrier <chr>, ## # flight <int>, tailnum <chr>, n <int> ``` .font80[If `tailnum` is the primary key, we can't have any duplicate values!] --- # Keys You can write a .hi-medgrn[unit test] into your code to make sure uniqueness is true before proceeding: ``` r dups_planes <- planes %>% count(tailnum) %>% filter(n > 1) *stopifnot(nrow(dups_planes) == 0) dups_weather <- weather |> # same thing using base R pipe count(year, month, day, hour, origin) |> filter(n > 1) *stopifnot(nrow(dups_weather) == 0) ``` ``` ## Error: nrow(dups_weather) == 0 is not TRUE ``` --- # Surrogate Keys What's the primary key in the `flights` data frame? Take a minute to investigate/verify. --- # Surrogate Keys What's the primary key in the `flights` data frame? Take a minute to investigate/verify. You might think it would be the date + the carrier + the flight or tail number, but neither of those are unique: ``` r flights %>% count(year, month, day, carrier, flight) %>% filter(n > 1) ``` ``` ## # A tibble: 24 × 6 ## year month day carrier flight n ## <int> <int> <int> <chr> <int> <int> ## 1 2013 6 8 WN 2269 2 ## 2 2013 6 15 WN 2269 2 ## 3 2013 6 22 WN 2269 2 ## 4 2013 6 29 WN 2269 2 ## 5 2013 7 6 WN 2269 2 ## 6 2013 7 13 WN 2269 2 ## 7 2013 7 20 WN 2269 2 ## 8 2013 7 27 WN 2269 2 ## 9 2013 8 3 WN 2269 2 ## 10 2013 8 10 WN 2269 2 ## # ℹ 14 more rows ``` --- # Surrogate keys If a data frame lacks a primary key but it is tidy (each row is an observation), it's often useful to add in a .hi-blue[surrogate key]: ``` r flights2 = flights %>% arrange(year, month, day, carrier, flight, sched_dep_time) %>% mutate(id = row_number()) %>% relocate(id) %>% head(8) ``` --- # Relations A .hi-medgrn[primary key] and the corresponding .hi-purple[foreign key] in another data frame form a **relation**. In general, relations are .hi-slate[one-to-many]: Each flight has one plane, but each plane has many flights. * Sometimes you'll see a .hi-slate[one-to-one] relation, but you can think of this as a special case of one-to-many. * You can also find .hi-slate[many-to-many] relations, but you can think of these as two one-to-many relations going in each direction. * There's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines. --- # Relations **Note on Stata: NEVER USE `merge m:m`. JUST DON'T DO IT.** There is no scenario in which it will give you what you want. This syntax should not exist. If you are tempted, you are probably either confused or looking for `joinby`. --- # Relations `join` does **not** think about whether your key is unique, or what type of relation you have. * Instead, it simply returns all possible combinations of observations in your two dataframes: .pull-left[ <img src = "data:image/png;base64,#images/join-one-to-many.png" width = "400" /> ] .pull-right[ <img src = "data:image/png;base64,#images/join-many-to-many.png" width = "400" /> ] --- # Duplicate Keys What if you join by a key that is not actually unique, when you think it is? You'll get .hi-medgrn[extra rows with incorrect matches]: ``` r flights_weather <- flights %>% left_join(weather, by=c("year", "month", "day", "origin")) nrow(flights_weather) ``` ``` ## [1] 8036575 ``` Now you no longer have a dataframe of unique flights. ``` r nrow(flights) ``` ``` ## [1] 336776 ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[1\. Confirm the primary key in `planes` is unique] ``` r # Confirm that tailnum is the primary key (unique ID) of planes dups_planes <- planes %>% count(tailnum) %>% filter(n > 1) stopifnot(nrow(dups_planes) == 0) ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[2\. Join, keeping the original join keys from both datasets] ``` r # Join, keeping the join keys from both datasets flights_planes <- flights %>% left_join(planes %>% rename(year_built = year), by="tailnum", keep=TRUE) %>% rename(tailnum = tailnum.x, tailnum_planes = tailnum.y) ``` --- # Best Practice: Joins Here's an example of a good (safe) way to join `flights` and `planes`: .hi-blue[3\. Confirm the join was one-to-many] ``` r # Confirm the join was 1:many stopifnot(nrow(flights) == nrow(flights_planes)) ``` --- class: inverse, middle name: string # String Cleaning Parts of this section are adapted from [.hi-dkorange[“Introduction to Data Science”]](http://rafalab.dfci.harvard.edu/dsbook/string-processing.html) by Rafael A. Irizarry, used under [.hi-dkorange[CC BY-NC-SA 4.0]](https://creativecommons.org/licenses/by-nc-sa/4.0). --- # String Cleaning Regardless of where we get them from, .hi-medgrn[character strings] often require a lot of cleaning work to get them into our desired formats * .hi-slate[Surveys:] report agricultural yields in a mix of bushels, pounds, hundred weight, tons, etc. * .hi-slate[Admin records:] manual entry of information prone to typos/inconsistencies Whether we want to convert to numeric values/dates or find matching info, we will likely need to do some pre-processing on our strings. -- Let's practice some .hi-purple[key string cleaning steps]. --- # String Cleaning Example Let's load in the raw data output from a web form asking students to report their height in inches: ``` r library(dslabs) data(reported_heights) str(reported_heights) ``` ``` ## 'data.frame': 1095 obs. of 3 variables: ## $ time_stamp: chr "2014-09-02 13:40:36" "2014-09-02 13:46:59" "2014-09-02 13:59:20" "2014-09-02 14:51:53" ... ## $ sex : chr "Male" "Male" "Male" "Male" ... ## $ height : chr "75" "70" "68" "74" ... ``` -- Unfortunately `height` is not numeric. Can we coerce it to numeric? ``` r heights2 <- reported_heights %>% mutate(height_num = as.numeric(height)) sum(is.na(heights2$height_num)) ``` ``` ## [1] 81 ``` --- # String Cleaning Example Yes, but we .hi-medgrn[lose a lot of information] because there are plenty of .hi-blue[non-numeric entries:] ``` r heights_probs <- filter(heights2, is.na(height_num)) View(heights_probs) heights_probs$height ``` ``` ## [1] "5' 4\"" "165cm" "5'7" ## [4] ">9000" "5'7\"" "5'3\"" ## [7] "5 feet and 8.11 inches" "5'11" "5'9''" ## [10] "5'10''" "5,3" "6'" ## [13] "6,8" "5' 10" "Five foot eight inches" ## [16] "5'5\"" "5'2\"" "5,4" ## [19] "5'3" "5'10''" "5'3''" ## [22] "5'7''" "5'12" "2'33" ## [25] "5'11" "5'3\"" "5,8" ## [28] "5'6''" "5'4" "1,70" ## [31] "5'7.5''" "5'7.5''" "5'2\"" ## [34] "5' 7.78\"" "yyy" "5'5" ## [37] "5'8" "5'6" "5 feet 7inches" ## [40] "6*12" "5 .11" "5 11" ## [43] "5'4" "5'8\"" "5'5" ## [46] "5'7" "5'6" "5'11\"" ## [49] "5'7\"" "5'7" "5'8" ## [52] "5' 11\"" "6'1\"" "69\"" ## [55] "5' 7\"" "5'10''" "5'10" ## [58] "5'10" "5ft 9 inches" "5 ft 9 inches" ## [61] "5'2" "5'11" "5'11''" ## [64] "5'8\"" "708,661" "5 feet 6 inches" ## [67] "5'10''" "5'8" "6'3\"" ## [70] "649,606" "728,346" "6 04" ## [73] "5'9" "5'5''" "5'7\"" ## [76] "6'4\"" "5'4" "170 cm" ## [79] "7,283,465" "5'6" "5'6" ``` --- # String Cleaning Workflow Many of these entries have valuable information, so let's try to salvage as much as we can. The general way to proceed is: 1. Identify the .hi-blue[most common patterns] among the problematic entries. -- 2. .hi-pink[Write an algorithm] to correct these. -- 3. .hi-medgrn[Review results] to make sure your algorithm worked correctly. -- 4. Look at the .hi-purple[remaining problematic entries]. Tweak your algorithm or add another one. -- 5. .hi-red[Stop] when all useful information is .hi-red[corrected] (or when MB < MC). What are the .hi-blue[most common patterns?] --- # String Cleaning Workflow Many of these entries have valuable information, so let's try to salvage as much as we can. The general way to proceed is: 1. Identify the .hi-blue[most common patterns] among the problematic entries. 2. .hi-pink[Write an algorithm] to correct these. 3. .hi-medgrn[Review results] to make sure your algorithm worked correctly. 4. Look at the .hi-purple[remaining problematic entries]. Tweak your algorithm or add another one. 5. .hi-red[Stop] when all useful information is .hi-red[corrected] (or when MB < MC). What are the .hi-blue[most common patterns?] * Strings of the form `x'y` or `x'y"` where `x` is feet and `y` is inches. * Strings of the form `x ft y inches`, except that "ft" and "inches" are inconsistent. --- # String Cleaning Workflow Many of these entries have valuable information, so let's try to salvage as much as we can. The general way to proceed is: 1. Identify the .hi-blue[most common patterns] among the problematic entries. 2. .hi-pink[Write an algorithm] to correct these. 3. .hi-medgrn[Review results] to make sure your algorithm worked correctly. 4. Look at the .hi-purple[remaining problematic entries]. Tweak your algorithm or add another one. 5. .hi-red[Stop] when all useful information is .hi-red[corrected] (or when MB < MC). .hi-slate[My suggested approach:] 1. Try to convert everything to the pattern `x y`. 2. `separate` the feet and inches values. 3. Calculate total inches from feet and inches. --- # String Cleaning Functions To implement this, we'll use a subset of .hi-slate[stringr]'s .hi-medgrn[string cleaning functions]: | Function | Description | |:------|:------| | `str_replace(_all)(df, str, pattern, replacement)` | Replace the first (all) matches of `pattern` with `replacement` within the string `str` | | `str_trim(df, str)` | Remove all whitespace from start and end of string `str`| | `str_squish(df, str)` | Remove all whitespace from start and end of string `str` *and* replace all internal whitespace with a single space| --- # 1. Replace Punctuation Start by replacing 3 different punctuation marks with spaces (note we have to .hi-slate[escape] the " with `\"`): ``` r heights2 <- reported_heights %>% * mutate(height_clean = str_replace_all(height, "'", " "), * height_clean = str_replace_all(height_clean, ",", " "), * height_clean = str_replace_all(height_clean, "\"", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" ## [3] "68" "74" ## [5] "61" "65" ## [7] "66" "62" ## [9] "66" "67" ## [11] "72" "6" ## [13] "69" "68" ## [15] "69" "66" ## [17] "75" "64" ## [19] "60" "67" ## [21] "66" "5 4 " ## [23] "70" "73" ## [25] "72" "69" ## [27] "69" "72" ## [29] "64" "72" ## [31] "75" "71" ## [33] "67" "66" ## [35] "67" "69" ## [37] "68" "66.75" ## [39] "72" "5.3" ## [41] "69" "68" ## [43] "63" "60" ## [45] "73" "74" ## [47] "74" "66" ## [49] "68" "73" ## [51] "70" "68" ## [53] "73" "70.5" ## [55] "165cm" "71" ## [57] "70" "67" ## [59] "69" "67" ## [61] "69" "73" ## [63] "74" "70" ## [65] "66" "511" ## [67] "72" "65" ## [69] "65" "70" ## [71] "73" "67" ## [73] "72" "68" ## [75] "68" "65" ## [77] "72" "71" ## [79] "65" "72" ## [81] "69" "70" ## [83] "72" "6" ## [85] "62" "65" ## [87] "70" "60" ## [89] "67" "62" ## [91] "71" "63" ## [93] "68" "64.1732" ## [95] "64" "71" ## [97] "68.5" "62" ## [99] "2" "70" ## [101] "72" "71" ## [103] "71" "69.6" ## [105] "65" "70" ## [107] "61" "63" ## [109] "70" "5 7" ## [111] "62" "68" ## [113] "73" "66.5" ## [115] "69" "74" ## [117] "71.5" "76" ## [119] "69" "74" ## [121] "74.5" "69" ## [123] "66" "64" ## [125] "78" ">9000" ## [127] "5 7 " "69" ## [129] "67" "63" ## [131] "74" "62" ## [133] "69" "64" ## [135] "71" "62.5" ## [137] "68" "67" ## [139] "71" "74" ## [141] "75" "65" ## [143] "68" "65" ## [145] "66" "72" ## [147] "73" "71" ## [149] "74" "5 3 " ## [151] "73" "68" ## [153] "77" "70.5" ## [155] "63" "69" ## [157] "69" "68.89" ## [159] "66.5" "64.173" ## [161] "63" "65" ## [163] "64" "63" ## [165] "63" "69" ## [167] "69" "64" ## [169] "62" "70" ## [171] "70" "59" ## [173] "65" "67.7" ## [175] "72" "74" ## [177] "71.7" "70.87" ## [179] "66" "72" ## [181] "74" "69" ## [183] "71" "70" ## [185] "70" "64" ## [187] "5 feet and 8.11 inches" "68" ## [189] "66" "64" ## [191] "67" "65" ## [193] "72" "5.25" ## [195] "70" "64.57" ## [197] "51" "63" ## [199] "70" "68" ## [201] "67" "5 11" ## [203] "69" "63" ## [205] "71" "70" ## [207] "64" "70" ## [209] "68" "66" ## [211] "69" "67" ## [213] "65" "72" ## [215] "72" "70" ## [217] "75" "64" ## [219] "66" "68" ## [221] "69" "73" ## [223] "66" "71" ## [225] "67" "70" ## [227] "67" "61" ## [229] "64" "64" ## [231] "5.5" "72" ## [233] "70" "65" ## [235] "11111" "5 9 " ## [237] "68" "67" ## [239] "59" "59" ## [241] "6" "6.5" ## [243] "72" "150" ## [245] "5 10 " "69" ## [247] "72" "69" ## [249] "68.11" "68.11" ## [251] "68" "103.2" ## [253] "65" "67" ## [255] "67" "68" ## [257] "69" "80" ## [259] "5.8" "68" ## [261] "19" "5" ## [263] "70" "60" ## [265] "60" "64.96" ## [267] "5.6" "175" ## [269] "177" "300" ## [271] "68" "70" ## [273] "66" "66" ## [275] "5 3" "72" ## [277] "68" "6 " ## [279] "72.05" "71" ## [281] "72.5" "70" ## [283] "72" "6" ## [285] "68" "70" ## [287] "5.9" "6 8" ## [289] "68" "66" ## [291] "5 10" "5.5" ## [293] "178" "75" ## [295] "66" "69" ## [297] "53" "71" ## [299] "70" "72" ## [301] "68.5" "68" ## [303] "163" "68" ## [305] "59.0551" "66.92" ## [307] "72" "72" ## [309] "6.2" "175" ## [311] "Five foot eight inches" "70" ## [313] "69" "61" ## [315] "6.2" "65" ## [317] "72" "66" ## [319] "5.8" "5.1" ## [321] "70" "65" ## [323] "68" "178" ## [325] "70" "71" ## [327] "74" "165" ## [329] "71" "74" ## [331] "64" "5.11" ## [333] "66" "5 5 " ## [335] "165" "74" ## [337] "75" "180" ## [339] "68" "5 2 " ## [341] "74.8" "69" ## [343] "68.4" "5.75" ## [345] "67" "65" ## [347] "68" "67" ## [349] "69.3" "169" ## [351] "67" "67" ## [353] "73" "75" ## [355] "70" "71" ## [357] "5 4" "72" ## [359] "70" "72" ## [361] "71.5" "72" ## [363] "7" "68.8976" ## [365] "68" "5.4" ## [367] "69" "70" ## [369] "75" "67" ## [371] "157" "6.1" ## [373] "169" "5 3" ## [375] "5.6" "214" ## [377] "183" "67.72" ## [379] "65" "69" ## [381] "71" "5.6" ## [383] "72" "72.4" ## [385] "69" "6" ## [387] "79.05" "162" ## [389] "178" "180" ## [391] "66.4" "72" ## [393] "5 10 " "66" ## [395] "69.29" "170" ## [397] "5 3 " "66.1416" ## [399] "78" "75" ## [401] "178" "70" ## [403] "66" "65" ## [405] "71" "0.7" ## [407] "68.5" "65" ## [409] "69" "190" ## [411] "70" "66" ## [413] "66" "63" ## [415] "71" "5.4" ## [417] "70" "72" ## [419] "184" "5 7 " ## [421] "68.5" "68" ## [423] "5.9" "70" ## [425] "68.5" "5 12" ## [427] "53.77" "66.92" ## [429] "73" "68" ## [431] "72" "70" ## [433] "5.6" "71.5" ## [435] "5.6" "184" ## [437] "72" "67" ## [439] "6" "167" ## [441] "72" "74" ## [443] "72" "72" ## [445] "67.7165" "71" ## [447] "65" "2 33" ## [449] "72" "68" ## [451] "5 11" "69" ## [453] "5 3 " "71" ## [455] "5.5" "5.2" ## [457] "67" "180" ## [459] "54" "5.5" ## [461] "5.5" "6.5" ## [463] "67" "68.1102" ## [465] "69" "5 8" ## [467] "62" "72" ## [469] "70" "71" ## [471] "64.961" "64" ## [473] "67" "180" ## [475] "66" "183" ## [477] "170" "5 6 " ## [479] "66" "70" ## [481] "66" "64" ## [483] "72" "70" ## [485] "172" "75" ## [487] "72" "72" ## [489] "612" "68.5" ## [491] "66" "5.11" ## [493] "71" "168" ## [495] "72" "5 4" ## [497] "70" "67" ## [499] "70" "70" ## [501] "72" "63" ## [503] "65" "63" ## [505] "1 70" "70" ## [507] "70.8" "172" ## [509] "87" "68" ## [511] "66" "77.1654" ## [513] "5.5" "176" ## [515] "5 7.5 " "5 7.5 " ## [517] "68" "66" ## [519] "74" "72" ## [521] "72.83" "77" ## [523] "111" "67" ## [525] "70" "64" ## [527] "65" "73" ## [529] "5 2 " "66.7" ## [531] "69" "67" ## [533] "173" "69" ## [535] "174" "176" ## [537] "175" "71" ## [539] "64" "67" ## [541] "70" "68" ## [543] "5 7.78 " "68.5" ## [545] "68" "67.71" ## [547] "66.5" "69" ## [549] "6.7" "64" ## [551] "60" "12" ## [553] "6" "5.1" ## [555] "74" "5.6" ## [557] "5.5" "67" ## [559] "68.8" "69" ## [561] "78" "70" ## [563] "66" "66" ## [565] "74" "71" ## [567] "yyy" "73" ## [569] "67" "72" ## [571] "70" "70.1" ## [573] "5.2" "69" ## [575] "5 5" "5 8" ## [577] "72" "72" ## [579] "72" "63" ## [581] "5 6" "67" ## [583] "73.2" "64" ## [585] "74" "73.62" ## [587] "72" "68.4" ## [589] "69" "61" ## [591] "5 feet 7inches" "89" ## [593] "69" "5.6" ## [595] "68" "65" ## [597] "68.9" "67.71" ## [599] "67" "67" ## [601] "59" "59" ## [603] "5.7" "183" ## [605] "71" "172" ## [607] "68" "66.9" ## [609] "34" "68.5" ## [611] "74" "25" ## [613] "76" "60" ## [615] "69" "6" ## [617] "69" "67" ## [619] "70" "68.5" ## [621] "70" "67" ## [623] "70" "69" ## [625] "65" "68.9" ## [627] "5.9" "168" ## [629] "66.9291" "70" ## [631] "6.5" "70.85" ## [633] "170" "175" ## [635] "64" "69" ## [637] "67" "72" ## [639] "6" "22" ## [641] "67" "72" ## [643] "72" "72.83" ## [645] "70" "63" ## [647] "71" "72" ## [649] "72" "72.44" ## [651] "70" "68" ## [653] "59" "5.11" ## [655] "684" "61.32" ## [657] "6" "1" ## [659] "68" "73" ## [661] "71" "69" ## [663] "71" "66.93" ## [665] "1" "58" ## [667] "55" "70" ## [669] "6*12" "66" ## [671] "65" "72" ## [673] "74" "5 .11" ## [675] "69" "87" ## [677] "162" "71" ## [679] "72" "68" ## [681] "69" "70" ## [683] "73" "165" ## [685] "184" "73.22" ## [687] "64" "70" ## [689] "75" "63" ## [691] "72.05" "6" ## [693] "68" "75" ## [695] "68" "173" ## [697] "70" "66.1417" ## [699] "74" "1.6" ## [701] "68" "68" ## [703] "70" "71" ## [705] "172" "170" ## [707] "70" "70.08" ## [709] "67.2" "69" ## [711] "70" "69" ## [713] "72.44" "69" ## [715] "72.45" "69" ## [717] "67" "75.98" ## [719] "71" "67" ## [721] "71" "5.7" ## [723] "5.5" "62" ## [725] "71" "74" ## [727] "69" "72" ## [729] "174" "170" ## [731] "68" "66.9291" ## [733] "63" "160" ## [735] "120" "120" ## [737] "23" "70" ## [739] "192" "68" ## [741] "69" "5 11" ## [743] "71" "69.3" ## [745] "167" "70.866" ## [747] "67" "67.72" ## [749] "70" "67" ## [751] "65" "67" ## [753] "62.4" "71" ## [755] "150" "72" ## [757] "68" "75" ## [759] "72" "66" ## [761] "71" "76" ## [763] "69" "69" ## [765] "1.7" "66" ## [767] "174" "69.29" ## [769] "72" "5.8" ## [771] "64" "66" ## [773] "70" "67" ## [775] "75.6" "71.5" ## [777] "6" "63" ## [779] "62" "69" ## [781] "68" "71.65" ## [783] "75" "68.8976" ## [785] "74" "65" ## [787] "5 4" "5 8 " ## [789] "5 5" "64" ## [791] "67" "69" ## [793] "5.8" "67" ## [795] "68.8976" "62.6" ## [797] "70" "69" ## [799] "65" "63" ## [801] "67.3" "64.2" ## [803] "63" "5.1" ## [805] "67" "5.11" ## [807] "61" "61" ## [809] "68.11" "66.14" ## [811] "67.5" "60" ## [813] "5.7" "67.72" ## [815] "66" "5 7" ## [817] "70" "5 6" ## [819] "64.5" "68" ## [821] "69" "68" ## [823] "5 11 " "76" ## [825] "70.8661" "66" ## [827] "67" "71" ## [829] "62.5" "5 7 " ## [831] "5 7" "172" ## [833] "60" "67" ## [835] "5 8" "75" ## [837] "65" "65" ## [839] "64" "62" ## [841] "180" "75" ## [843] "5 11 " "67" ## [845] "72" "5" ## [847] "68" "64" ## [849] "180" "180" ## [851] "6 1 " "75.4" ## [853] "68" "70" ## [855] "5.9" "63" ## [857] "5.2" "73" ## [859] "72" "68" ## [861] "61" "60" ## [863] "5.5" "63" ## [865] "67" "69 " ## [867] "74" "68" ## [869] "70" "73" ## [871] "70" "60" ## [873] "66" "68" ## [875] "5 7 " "65" ## [877] "71" "5 10 " ## [879] "65" "72.8346" ## [881] "5.51" "5 10" ## [883] "70" "65" ## [885] "5 10" "70" ## [887] "67.7" "62" ## [889] "5ft 9 inches" "5 ft 9 inches" ## [891] "5 2" "74" ## [893] "5 11" "70" ## [895] "64" "73" ## [897] "70" "66" ## [899] "66" "5.8" ## [901] "62" "66" ## [903] "64" "67" ## [905] "71" "78" ## [907] "69" "5.7" ## [909] "70" "65" ## [911] "70" "167" ## [913] "71" "68" ## [915] "67" "168" ## [917] "64" "68" ## [919] "60" "64" ## [921] "73.2" "6" ## [923] "6.1" "71" ## [925] "72.44" "67" ## [927] "71" "5 11 " ## [929] "68.89" "68.89" ## [931] "5.69" "178" ## [933] "182" "69" ## [935] "164" "66" ## [937] "68" "70" ## [939] "70" "5 8 " ## [941] "70.1" "185" ## [943] "6" "86" ## [945] "72" "66" ## [947] "66" "69" ## [949] "66" "78.74" ## [951] "5.7" "70" ## [953] "67" "66" ## [955] "70" "708 661" ## [957] "5.25" "72.8346" ## [959] "67.7" "72" ## [961] "69" "5.5" ## [963] "5 feet 6 inches" "5 10 " ## [965] "172" "67" ## [967] "6" "5 8" ## [969] "69" "64.9" ## [971] "160" "6 3 " ## [973] "69" "64.96" ## [975] "649 606" "10000" ## [977] "64.5" "64" ## [979] "65" "70" ## [981] "5.1" "67" ## [983] "152" "1" ## [985] "67.5" "180" ## [987] "70.86" "728 346" ## [989] "175" "70" ## [991] "158" "173" ## [993] "164" "6 04" ## [995] "169" "0" ## [997] "70" "185" ## [999] "168" "5 9" ## [1001] "169" "5 5 " ## [1003] "68" "174" ## [1005] "68" "68.11024" ## [1007] "6.3" "69" ## [1009] "179" "66" ## [1011] "69" "5 7 " ## [1013] "68.89" "73.22" ## [1015] "66" "5.5" ## [1017] "6" "75" ## [1019] "6" "170" ## [1021] "52" "52" ## [1023] "79" "70" ## [1025] "6" "172" ## [1027] "64" "68.11" ## [1029] "67" "158" ## [1031] "100" "75" ## [1033] "75" "81" ## [1035] "69" "68.8976378" ## [1037] "159" "76" ## [1039] "190" "69.6" ## [1041] "5.7" "63" ## [1043] "70" "72.44" ## [1045] "66" "170" ## [1047] "158" "73" ## [1049] "72" "60" ## [1051] "6 4 " "65" ## [1053] "66" "180" ## [1055] "68.9" "5.57" ## [1057] "71" "5 4" ## [1059] "210" "72" ## [1061] "68" "67" ## [1063] "88" "69" ## [1065] "64.2" "6" ## [1067] "63" "162" ## [1069] "66" "170 cm" ## [1071] "74" "72" ## [1073] "5.7" "71" ## [1075] "50" "170" ## [1077] "157" "186" ## [1079] "65" "170" ## [1081] "63" "7 283 465" ## [1083] "74" "67.7165" ## [1085] "5" "5" ## [1087] "34" "69" ## [1089] "69" "50" ## [1091] "69" "69" ## [1093] "161" "5 6" ## [1095] "5 6" ``` --- # 1. Replace Punctuation We can make this more concise with the .hi-medgrn["or" operator (`|`)]: * Rather than three iterations, write one call that replaces all matches of ' or " or , with " " with a single argument ``` r heights2 <- reported_heights %>% * mutate(height_clean = str_replace_all(height, "'|,|\"|,", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" ## [3] "68" "74" ## [5] "61" "65" ## [7] "66" "62" ## [9] "66" "67" ## [11] "72" "6" ## [13] "69" "68" ## [15] "69" "66" ## [17] "75" "64" ## [19] "60" "67" ## [21] "66" "5 4 " ## [23] "70" "73" ## [25] "72" "69" ## [27] "69" "72" ## [29] "64" "72" ## [31] "75" "71" ## [33] "67" "66" ## [35] "67" "69" ## [37] "68" "66.75" ## [39] "72" "5.3" ## [41] "69" "68" ## [43] "63" "60" ## [45] "73" "74" ## [47] "74" "66" ## [49] "68" "73" ## [51] "70" "68" ## [53] "73" "70.5" ## [55] "165cm" "71" ## [57] "70" "67" ## [59] "69" "67" ## [61] "69" "73" ## [63] "74" "70" ## [65] "66" "511" ## [67] "72" "65" ## [69] "65" "70" ## [71] "73" "67" ## [73] "72" "68" ## [75] "68" "65" ## [77] "72" "71" ## [79] "65" "72" ## [81] "69" "70" ## [83] "72" "6" ## [85] "62" "65" ## [87] "70" "60" ## [89] "67" "62" ## [91] "71" "63" ## [93] "68" "64.1732" ## [95] "64" "71" ## [97] "68.5" "62" ## [99] "2" "70" ## [101] "72" "71" ## [103] "71" "69.6" ## [105] "65" "70" ## [107] "61" "63" ## [109] "70" "5 7" ## [111] "62" "68" ## [113] "73" "66.5" ## [115] "69" "74" ## [117] "71.5" "76" ## [119] "69" "74" ## [121] "74.5" "69" ## [123] "66" "64" ## [125] "78" ">9000" ## [127] "5 7 " "69" ## [129] "67" "63" ## [131] "74" "62" ## [133] "69" "64" ## [135] "71" "62.5" ## [137] "68" "67" ## [139] "71" "74" ## [141] "75" "65" ## [143] "68" "65" ## [145] "66" "72" ## [147] "73" "71" ## [149] "74" "5 3 " ## [151] "73" "68" ## [153] "77" "70.5" ## [155] "63" "69" ## [157] "69" "68.89" ## [159] "66.5" "64.173" ## [161] "63" "65" ## [163] "64" "63" ## [165] "63" "69" ## [167] "69" "64" ## [169] "62" "70" ## [171] "70" "59" ## [173] "65" "67.7" ## [175] "72" "74" ## [177] "71.7" "70.87" ## [179] "66" "72" ## [181] "74" "69" ## [183] "71" "70" ## [185] "70" "64" ## [187] "5 feet and 8.11 inches" "68" ## [189] "66" "64" ## [191] "67" "65" ## [193] "72" "5.25" ## [195] "70" "64.57" ## [197] "51" "63" ## [199] "70" "68" ## [201] "67" "5 11" ## [203] "69" "63" ## [205] "71" "70" ## [207] "64" "70" ## [209] "68" "66" ## [211] "69" "67" ## [213] "65" "72" ## [215] "72" "70" ## [217] "75" "64" ## [219] "66" "68" ## [221] "69" "73" ## [223] "66" "71" ## [225] "67" "70" ## [227] "67" "61" ## [229] "64" "64" ## [231] "5.5" "72" ## [233] "70" "65" ## [235] "11111" "5 9 " ## [237] "68" "67" ## [239] "59" "59" ## [241] "6" "6.5" ## [243] "72" "150" ## [245] "5 10 " "69" ## [247] "72" "69" ## [249] "68.11" "68.11" ## [251] "68" "103.2" ## [253] "65" "67" ## [255] "67" "68" ## [257] "69" "80" ## [259] "5.8" "68" ## [261] "19" "5" ## [263] "70" "60" ## [265] "60" "64.96" ## [267] "5.6" "175" ## [269] "177" "300" ## [271] "68" "70" ## [273] "66" "66" ## [275] "5 3" "72" ## [277] "68" "6 " ## [279] "72.05" "71" ## [281] "72.5" "70" ## [283] "72" "6" ## [285] "68" "70" ## [287] "5.9" "6 8" ## [289] "68" "66" ## [291] "5 10" "5.5" ## [293] "178" "75" ## [295] "66" "69" ## [297] "53" "71" ## [299] "70" "72" ## [301] "68.5" "68" ## [303] "163" "68" ## [305] "59.0551" "66.92" ## [307] "72" "72" ## [309] "6.2" "175" ## [311] "Five foot eight inches" "70" ## [313] "69" "61" ## [315] "6.2" "65" ## [317] "72" "66" ## [319] "5.8" "5.1" ## [321] "70" "65" ## [323] "68" "178" ## [325] "70" "71" ## [327] "74" "165" ## [329] "71" "74" ## [331] "64" "5.11" ## [333] "66" "5 5 " ## [335] "165" "74" ## [337] "75" "180" ## [339] "68" "5 2 " ## [341] "74.8" "69" ## [343] "68.4" "5.75" ## [345] "67" "65" ## [347] "68" "67" ## [349] "69.3" "169" ## [351] "67" "67" ## [353] "73" "75" ## [355] "70" "71" ## [357] "5 4" "72" ## [359] "70" "72" ## [361] "71.5" "72" ## [363] "7" "68.8976" ## [365] "68" "5.4" ## [367] "69" "70" ## [369] "75" "67" ## [371] "157" "6.1" ## [373] "169" "5 3" ## [375] "5.6" "214" ## [377] "183" "67.72" ## [379] "65" "69" ## [381] "71" "5.6" ## [383] "72" "72.4" ## [385] "69" "6" ## [387] "79.05" "162" ## [389] "178" "180" ## [391] "66.4" "72" ## [393] "5 10 " "66" ## [395] "69.29" "170" ## [397] "5 3 " "66.1416" ## [399] "78" "75" ## [401] "178" "70" ## [403] "66" "65" ## [405] "71" "0.7" ## [407] "68.5" "65" ## [409] "69" "190" ## [411] "70" "66" ## [413] "66" "63" ## [415] "71" "5.4" ## [417] "70" "72" ## [419] "184" "5 7 " ## [421] "68.5" "68" ## [423] "5.9" "70" ## [425] "68.5" "5 12" ## [427] "53.77" "66.92" ## [429] "73" "68" ## [431] "72" "70" ## [433] "5.6" "71.5" ## [435] "5.6" "184" ## [437] "72" "67" ## [439] "6" "167" ## [441] "72" "74" ## [443] "72" "72" ## [445] "67.7165" "71" ## [447] "65" "2 33" ## [449] "72" "68" ## [451] "5 11" "69" ## [453] "5 3 " "71" ## [455] "5.5" "5.2" ## [457] "67" "180" ## [459] "54" "5.5" ## [461] "5.5" "6.5" ## [463] "67" "68.1102" ## [465] "69" "5 8" ## [467] "62" "72" ## [469] "70" "71" ## [471] "64.961" "64" ## [473] "67" "180" ## [475] "66" "183" ## [477] "170" "5 6 " ## [479] "66" "70" ## [481] "66" "64" ## [483] "72" "70" ## [485] "172" "75" ## [487] "72" "72" ## [489] "612" "68.5" ## [491] "66" "5.11" ## [493] "71" "168" ## [495] "72" "5 4" ## [497] "70" "67" ## [499] "70" "70" ## [501] "72" "63" ## [503] "65" "63" ## [505] "1 70" "70" ## [507] "70.8" "172" ## [509] "87" "68" ## [511] "66" "77.1654" ## [513] "5.5" "176" ## [515] "5 7.5 " "5 7.5 " ## [517] "68" "66" ## [519] "74" "72" ## [521] "72.83" "77" ## [523] "111" "67" ## [525] "70" "64" ## [527] "65" "73" ## [529] "5 2 " "66.7" ## [531] "69" "67" ## [533] "173" "69" ## [535] "174" "176" ## [537] "175" "71" ## [539] "64" "67" ## [541] "70" "68" ## [543] "5 7.78 " "68.5" ## [545] "68" "67.71" ## [547] "66.5" "69" ## [549] "6.7" "64" ## [551] "60" "12" ## [553] "6" "5.1" ## [555] "74" "5.6" ## [557] "5.5" "67" ## [559] "68.8" "69" ## [561] "78" "70" ## [563] "66" "66" ## [565] "74" "71" ## [567] "yyy" "73" ## [569] "67" "72" ## [571] "70" "70.1" ## [573] "5.2" "69" ## [575] "5 5" "5 8" ## [577] "72" "72" ## [579] "72" "63" ## [581] "5 6" "67" ## [583] "73.2" "64" ## [585] "74" "73.62" ## [587] "72" "68.4" ## [589] "69" "61" ## [591] "5 feet 7inches" "89" ## [593] "69" "5.6" ## [595] "68" "65" ## [597] "68.9" "67.71" ## [599] "67" "67" ## [601] "59" "59" ## [603] "5.7" "183" ## [605] "71" "172" ## [607] "68" "66.9" ## [609] "34" "68.5" ## [611] "74" "25" ## [613] "76" "60" ## [615] "69" "6" ## [617] "69" "67" ## [619] "70" "68.5" ## [621] "70" "67" ## [623] "70" "69" ## [625] "65" "68.9" ## [627] "5.9" "168" ## [629] "66.9291" "70" ## [631] "6.5" "70.85" ## [633] "170" "175" ## [635] "64" "69" ## [637] "67" "72" ## [639] "6" "22" ## [641] "67" "72" ## [643] "72" "72.83" ## [645] "70" "63" ## [647] "71" "72" ## [649] "72" "72.44" ## [651] "70" "68" ## [653] "59" "5.11" ## [655] "684" "61.32" ## [657] "6" "1" ## [659] "68" "73" ## [661] "71" "69" ## [663] "71" "66.93" ## [665] "1" "58" ## [667] "55" "70" ## [669] "6*12" "66" ## [671] "65" "72" ## [673] "74" "5 .11" ## [675] "69" "87" ## [677] "162" "71" ## [679] "72" "68" ## [681] "69" "70" ## [683] "73" "165" ## [685] "184" "73.22" ## [687] "64" "70" ## [689] "75" "63" ## [691] "72.05" "6" ## [693] "68" "75" ## [695] "68" "173" ## [697] "70" "66.1417" ## [699] "74" "1.6" ## [701] "68" "68" ## [703] "70" "71" ## [705] "172" "170" ## [707] "70" "70.08" ## [709] "67.2" "69" ## [711] "70" "69" ## [713] "72.44" "69" ## [715] "72.45" "69" ## [717] "67" "75.98" ## [719] "71" "67" ## [721] "71" "5.7" ## [723] "5.5" "62" ## [725] "71" "74" ## [727] "69" "72" ## [729] "174" "170" ## [731] "68" "66.9291" ## [733] "63" "160" ## [735] "120" "120" ## [737] "23" "70" ## [739] "192" "68" ## [741] "69" "5 11" ## [743] "71" "69.3" ## [745] "167" "70.866" ## [747] "67" "67.72" ## [749] "70" "67" ## [751] "65" "67" ## [753] "62.4" "71" ## [755] "150" "72" ## [757] "68" "75" ## [759] "72" "66" ## [761] "71" "76" ## [763] "69" "69" ## [765] "1.7" "66" ## [767] "174" "69.29" ## [769] "72" "5.8" ## [771] "64" "66" ## [773] "70" "67" ## [775] "75.6" "71.5" ## [777] "6" "63" ## [779] "62" "69" ## [781] "68" "71.65" ## [783] "75" "68.8976" ## [785] "74" "65" ## [787] "5 4" "5 8 " ## [789] "5 5" "64" ## [791] "67" "69" ## [793] "5.8" "67" ## [795] "68.8976" "62.6" ## [797] "70" "69" ## [799] "65" "63" ## [801] "67.3" "64.2" ## [803] "63" "5.1" ## [805] "67" "5.11" ## [807] "61" "61" ## [809] "68.11" "66.14" ## [811] "67.5" "60" ## [813] "5.7" "67.72" ## [815] "66" "5 7" ## [817] "70" "5 6" ## [819] "64.5" "68" ## [821] "69" "68" ## [823] "5 11 " "76" ## [825] "70.8661" "66" ## [827] "67" "71" ## [829] "62.5" "5 7 " ## [831] "5 7" "172" ## [833] "60" "67" ## [835] "5 8" "75" ## [837] "65" "65" ## [839] "64" "62" ## [841] "180" "75" ## [843] "5 11 " "67" ## [845] "72" "5" ## [847] "68" "64" ## [849] "180" "180" ## [851] "6 1 " "75.4" ## [853] "68" "70" ## [855] "5.9" "63" ## [857] "5.2" "73" ## [859] "72" "68" ## [861] "61" "60" ## [863] "5.5" "63" ## [865] "67" "69 " ## [867] "74" "68" ## [869] "70" "73" ## [871] "70" "60" ## [873] "66" "68" ## [875] "5 7 " "65" ## [877] "71" "5 10 " ## [879] "65" "72.8346" ## [881] "5.51" "5 10" ## [883] "70" "65" ## [885] "5 10" "70" ## [887] "67.7" "62" ## [889] "5ft 9 inches" "5 ft 9 inches" ## [891] "5 2" "74" ## [893] "5 11" "70" ## [895] "64" "73" ## [897] "70" "66" ## [899] "66" "5.8" ## [901] "62" "66" ## [903] "64" "67" ## [905] "71" "78" ## [907] "69" "5.7" ## [909] "70" "65" ## [911] "70" "167" ## [913] "71" "68" ## [915] "67" "168" ## [917] "64" "68" ## [919] "60" "64" ## [921] "73.2" "6" ## [923] "6.1" "71" ## [925] "72.44" "67" ## [927] "71" "5 11 " ## [929] "68.89" "68.89" ## [931] "5.69" "178" ## [933] "182" "69" ## [935] "164" "66" ## [937] "68" "70" ## [939] "70" "5 8 " ## [941] "70.1" "185" ## [943] "6" "86" ## [945] "72" "66" ## [947] "66" "69" ## [949] "66" "78.74" ## [951] "5.7" "70" ## [953] "67" "66" ## [955] "70" "708 661" ## [957] "5.25" "72.8346" ## [959] "67.7" "72" ## [961] "69" "5.5" ## [963] "5 feet 6 inches" "5 10 " ## [965] "172" "67" ## [967] "6" "5 8" ## [969] "69" "64.9" ## [971] "160" "6 3 " ## [973] "69" "64.96" ## [975] "649 606" "10000" ## [977] "64.5" "64" ## [979] "65" "70" ## [981] "5.1" "67" ## [983] "152" "1" ## [985] "67.5" "180" ## [987] "70.86" "728 346" ## [989] "175" "70" ## [991] "158" "173" ## [993] "164" "6 04" ## [995] "169" "0" ## [997] "70" "185" ## [999] "168" "5 9" ## [1001] "169" "5 5 " ## [1003] "68" "174" ## [1005] "68" "68.11024" ## [1007] "6.3" "69" ## [1009] "179" "66" ## [1011] "69" "5 7 " ## [1013] "68.89" "73.22" ## [1015] "66" "5.5" ## [1017] "6" "75" ## [1019] "6" "170" ## [1021] "52" "52" ## [1023] "79" "70" ## [1025] "6" "172" ## [1027] "64" "68.11" ## [1029] "67" "158" ## [1031] "100" "75" ## [1033] "75" "81" ## [1035] "69" "68.8976378" ## [1037] "159" "76" ## [1039] "190" "69.6" ## [1041] "5.7" "63" ## [1043] "70" "72.44" ## [1045] "66" "170" ## [1047] "158" "73" ## [1049] "72" "60" ## [1051] "6 4 " "65" ## [1053] "66" "180" ## [1055] "68.9" "5.57" ## [1057] "71" "5 4" ## [1059] "210" "72" ## [1061] "68" "67" ## [1063] "88" "69" ## [1065] "64.2" "6" ## [1067] "63" "162" ## [1069] "66" "170 cm" ## [1071] "74" "72" ## [1073] "5.7" "71" ## [1075] "50" "170" ## [1077] "157" "186" ## [1079] "65" "170" ## [1081] "63" "7 283 465" ## [1083] "74" "67.7165" ## [1085] "5" "5" ## [1087] "34" "69" ## [1089] "69" "50" ## [1091] "69" "69" ## [1093] "161" "5 6" ## [1095] "5 6" ``` --- # 2. Remove Common Words + Extra Space Next, get rid of some common words and .hi-blue[trim extra spaces]: ``` r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), * height_clean = str_trim(height_clean)) heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 .11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 2. Remove Common Words + Extra Space Additionally remove extra spaces .hi-purple[before/after/within strings]: ``` r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), * height_clean = str_squish(height_clean)) heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 .11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 3. Remove Punctuation (Periods) ``` r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), height_clean = str_squish(height_clean), * height_clean = str_replace(height_clean, " \\.", " ")) heights2$height_clean ``` ``` ## [1] "75" "70" "68" "74" ## [5] "61" "65" "66" "62" ## [9] "66" "67" "72" "6" ## [13] "69" "68" "69" "66" ## [17] "75" "64" "60" "67" ## [21] "66" "5 4" "70" "73" ## [25] "72" "69" "69" "72" ## [29] "64" "72" "75" "71" ## [33] "67" "66" "67" "69" ## [37] "68" "66.75" "72" "5.3" ## [41] "69" "68" "63" "60" ## [45] "73" "74" "74" "66" ## [49] "68" "73" "70" "68" ## [53] "73" "70.5" "165" "71" ## [57] "70" "67" "69" "67" ## [61] "69" "73" "74" "70" ## [65] "66" "511" "72" "65" ## [69] "65" "70" "73" "67" ## [73] "72" "68" "68" "65" ## [77] "72" "71" "65" "72" ## [81] "69" "70" "72" "6" ## [85] "62" "65" "70" "60" ## [89] "67" "62" "71" "63" ## [93] "68" "64.1732" "64" "71" ## [97] "68.5" "62" "2" "70" ## [101] "72" "71" "71" "69.6" ## [105] "65" "70" "61" "63" ## [109] "70" "5 7" "62" "68" ## [113] "73" "66.5" "69" "74" ## [117] "71.5" "76" "69" "74" ## [121] "74.5" "69" "66" "64" ## [125] "78" ">9000" "5 7" "69" ## [129] "67" "63" "74" "62" ## [133] "69" "64" "71" "62.5" ## [137] "68" "67" "71" "74" ## [141] "75" "65" "68" "65" ## [145] "66" "72" "73" "71" ## [149] "74" "5 3" "73" "68" ## [153] "77" "70.5" "63" "69" ## [157] "69" "68.89" "66.5" "64.173" ## [161] "63" "65" "64" "63" ## [165] "63" "69" "69" "64" ## [169] "62" "70" "70" "59" ## [173] "65" "67.7" "72" "74" ## [177] "71.7" "70.87" "66" "72" ## [181] "74" "69" "71" "70" ## [185] "70" "64" "5 8.11" "68" ## [189] "66" "64" "67" "65" ## [193] "72" "5.25" "70" "64.57" ## [197] "51" "63" "70" "68" ## [201] "67" "5 11" "69" "63" ## [205] "71" "70" "64" "70" ## [209] "68" "66" "69" "67" ## [213] "65" "72" "72" "70" ## [217] "75" "64" "66" "68" ## [221] "69" "73" "66" "71" ## [225] "67" "70" "67" "61" ## [229] "64" "64" "5.5" "72" ## [233] "70" "65" "11111" "5 9" ## [237] "68" "67" "59" "59" ## [241] "6" "6.5" "72" "150" ## [245] "5 10" "69" "72" "69" ## [249] "68.11" "68.11" "68" "103.2" ## [253] "65" "67" "67" "68" ## [257] "69" "80" "5.8" "68" ## [261] "19" "5" "70" "60" ## [265] "60" "64.96" "5.6" "175" ## [269] "177" "300" "68" "70" ## [273] "66" "66" "5 3" "72" ## [277] "68" "6" "72.05" "71" ## [281] "72.5" "70" "72" "6" ## [285] "68" "70" "5.9" "6 8" ## [289] "68" "66" "5 10" "5.5" ## [293] "178" "75" "66" "69" ## [297] "53" "71" "70" "72" ## [301] "68.5" "68" "163" "68" ## [305] "59.0551" "66.92" "72" "72" ## [309] "6.2" "175" "Five foot eight" "70" ## [313] "69" "61" "6.2" "65" ## [317] "72" "66" "5.8" "5.1" ## [321] "70" "65" "68" "178" ## [325] "70" "71" "74" "165" ## [329] "71" "74" "64" "5.11" ## [333] "66" "5 5" "165" "74" ## [337] "75" "180" "68" "5 2" ## [341] "74.8" "69" "68.4" "5.75" ## [345] "67" "65" "68" "67" ## [349] "69.3" "169" "67" "67" ## [353] "73" "75" "70" "71" ## [357] "5 4" "72" "70" "72" ## [361] "71.5" "72" "7" "68.8976" ## [365] "68" "5.4" "69" "70" ## [369] "75" "67" "157" "6.1" ## [373] "169" "5 3" "5.6" "214" ## [377] "183" "67.72" "65" "69" ## [381] "71" "5.6" "72" "72.4" ## [385] "69" "6" "79.05" "162" ## [389] "178" "180" "66.4" "72" ## [393] "5 10" "66" "69.29" "170" ## [397] "5 3" "66.1416" "78" "75" ## [401] "178" "70" "66" "65" ## [405] "71" "0.7" "68.5" "65" ## [409] "69" "190" "70" "66" ## [413] "66" "63" "71" "5.4" ## [417] "70" "72" "184" "5 7" ## [421] "68.5" "68" "5.9" "70" ## [425] "68.5" "5 12" "53.77" "66.92" ## [429] "73" "68" "72" "70" ## [433] "5.6" "71.5" "5.6" "184" ## [437] "72" "67" "6" "167" ## [441] "72" "74" "72" "72" ## [445] "67.7165" "71" "65" "2 33" ## [449] "72" "68" "5 11" "69" ## [453] "5 3" "71" "5.5" "5.2" ## [457] "67" "180" "54" "5.5" ## [461] "5.5" "6.5" "67" "68.1102" ## [465] "69" "5 8" "62" "72" ## [469] "70" "71" "64.961" "64" ## [473] "67" "180" "66" "183" ## [477] "170" "5 6" "66" "70" ## [481] "66" "64" "72" "70" ## [485] "172" "75" "72" "72" ## [489] "612" "68.5" "66" "5.11" ## [493] "71" "168" "72" "5 4" ## [497] "70" "67" "70" "70" ## [501] "72" "63" "65" "63" ## [505] "1 70" "70" "70.8" "172" ## [509] "87" "68" "66" "77.1654" ## [513] "5.5" "176" "5 7.5" "5 7.5" ## [517] "68" "66" "74" "72" ## [521] "72.83" "77" "111" "67" ## [525] "70" "64" "65" "73" ## [529] "5 2" "66.7" "69" "67" ## [533] "173" "69" "174" "176" ## [537] "175" "71" "64" "67" ## [541] "70" "68" "5 7.78" "68.5" ## [545] "68" "67.71" "66.5" "69" ## [549] "6.7" "64" "60" "12" ## [553] "6" "5.1" "74" "5.6" ## [557] "5.5" "67" "68.8" "69" ## [561] "78" "70" "66" "66" ## [565] "74" "71" "yyy" "73" ## [569] "67" "72" "70" "70.1" ## [573] "5.2" "69" "5 5" "5 8" ## [577] "72" "72" "72" "63" ## [581] "5 6" "67" "73.2" "64" ## [585] "74" "73.62" "72" "68.4" ## [589] "69" "61" "5 7" "89" ## [593] "69" "5.6" "68" "65" ## [597] "68.9" "67.71" "67" "67" ## [601] "59" "59" "5.7" "183" ## [605] "71" "172" "68" "66.9" ## [609] "34" "68.5" "74" "25" ## [613] "76" "60" "69" "6" ## [617] "69" "67" "70" "68.5" ## [621] "70" "67" "70" "69" ## [625] "65" "68.9" "5.9" "168" ## [629] "66.9291" "70" "6.5" "70.85" ## [633] "170" "175" "64" "69" ## [637] "67" "72" "6" "22" ## [641] "67" "72" "72" "72.83" ## [645] "70" "63" "71" "72" ## [649] "72" "72.44" "70" "68" ## [653] "59" "5.11" "684" "61.32" ## [657] "6" "1" "68" "73" ## [661] "71" "69" "71" "66.93" ## [665] "1" "58" "55" "70" ## [669] "6*12" "66" "65" "72" ## [673] "74" "5 11" "69" "87" ## [677] "162" "71" "72" "68" ## [681] "69" "70" "73" "165" ## [685] "184" "73.22" "64" "70" ## [689] "75" "63" "72.05" "6" ## [693] "68" "75" "68" "173" ## [697] "70" "66.1417" "74" "1.6" ## [701] "68" "68" "70" "71" ## [705] "172" "170" "70" "70.08" ## [709] "67.2" "69" "70" "69" ## [713] "72.44" "69" "72.45" "69" ## [717] "67" "75.98" "71" "67" ## [721] "71" "5.7" "5.5" "62" ## [725] "71" "74" "69" "72" ## [729] "174" "170" "68" "66.9291" ## [733] "63" "160" "120" "120" ## [737] "23" "70" "192" "68" ## [741] "69" "5 11" "71" "69.3" ## [745] "167" "70.866" "67" "67.72" ## [749] "70" "67" "65" "67" ## [753] "62.4" "71" "150" "72" ## [757] "68" "75" "72" "66" ## [761] "71" "76" "69" "69" ## [765] "1.7" "66" "174" "69.29" ## [769] "72" "5.8" "64" "66" ## [773] "70" "67" "75.6" "71.5" ## [777] "6" "63" "62" "69" ## [781] "68" "71.65" "75" "68.8976" ## [785] "74" "65" "5 4" "5 8" ## [789] "5 5" "64" "67" "69" ## [793] "5.8" "67" "68.8976" "62.6" ## [797] "70" "69" "65" "63" ## [801] "67.3" "64.2" "63" "5.1" ## [805] "67" "5.11" "61" "61" ## [809] "68.11" "66.14" "67.5" "60" ## [813] "5.7" "67.72" "66" "5 7" ## [817] "70" "5 6" "64.5" "68" ## [821] "69" "68" "5 11" "76" ## [825] "70.8661" "66" "67" "71" ## [829] "62.5" "5 7" "5 7" "172" ## [833] "60" "67" "5 8" "75" ## [837] "65" "65" "64" "62" ## [841] "180" "75" "5 11" "67" ## [845] "72" "5" "68" "64" ## [849] "180" "180" "6 1" "75.4" ## [853] "68" "70" "5.9" "63" ## [857] "5.2" "73" "72" "68" ## [861] "61" "60" "5.5" "63" ## [865] "67" "69" "74" "68" ## [869] "70" "73" "70" "60" ## [873] "66" "68" "5 7" "65" ## [877] "71" "5 10" "65" "72.8346" ## [881] "5.51" "5 10" "70" "65" ## [885] "5 10" "70" "67.7" "62" ## [889] "5 9" "5 9" "5 2" "74" ## [893] "5 11" "70" "64" "73" ## [897] "70" "66" "66" "5.8" ## [901] "62" "66" "64" "67" ## [905] "71" "78" "69" "5.7" ## [909] "70" "65" "70" "167" ## [913] "71" "68" "67" "168" ## [917] "64" "68" "60" "64" ## [921] "73.2" "6" "6.1" "71" ## [925] "72.44" "67" "71" "5 11" ## [929] "68.89" "68.89" "5.69" "178" ## [933] "182" "69" "164" "66" ## [937] "68" "70" "70" "5 8" ## [941] "70.1" "185" "6" "86" ## [945] "72" "66" "66" "69" ## [949] "66" "78.74" "5.7" "70" ## [953] "67" "66" "70" "708 661" ## [957] "5.25" "72.8346" "67.7" "72" ## [961] "69" "5.5" "5 6" "5 10" ## [965] "172" "67" "6" "5 8" ## [969] "69" "64.9" "160" "6 3" ## [973] "69" "64.96" "649 606" "10000" ## [977] "64.5" "64" "65" "70" ## [981] "5.1" "67" "152" "1" ## [985] "67.5" "180" "70.86" "728 346" ## [989] "175" "70" "158" "173" ## [993] "164" "6 04" "169" "0" ## [997] "70" "185" "168" "5 9" ## [1001] "169" "5 5" "68" "174" ## [1005] "68" "68.11024" "6.3" "69" ## [1009] "179" "66" "69" "5 7" ## [1013] "68.89" "73.22" "66" "5.5" ## [1017] "6" "75" "6" "170" ## [1021] "52" "52" "79" "70" ## [1025] "6" "172" "64" "68.11" ## [1029] "67" "158" "100" "75" ## [1033] "75" "81" "69" "68.8976378" ## [1037] "159" "76" "190" "69.6" ## [1041] "5.7" "63" "70" "72.44" ## [1045] "66" "170" "158" "73" ## [1049] "72" "60" "6 4" "65" ## [1053] "66" "180" "68.9" "5.57" ## [1057] "71" "5 4" "210" "72" ## [1061] "68" "67" "88" "69" ## [1065] "64.2" "6" "63" "162" ## [1069] "66" "170" "74" "72" ## [1073] "5.7" "71" "50" "170" ## [1077] "157" "186" "65" "170" ## [1081] "63" "7 283 465" "74" "67.7165" ## [1085] "5" "5" "34" "69" ## [1089] "69" "50" "69" "69" ## [1093] "161" "5 6" "5 6" ``` --- # 4. Calculate Total Inches Now .hi-blue[separate] the cleaned height into feet and inch variables. To do this, we'll need one of the `separate_` functions. There are two types: -- 1. .hi-medgrn[separate_wider_X]: separate one variable into .hi-medgrn[multiple columns] * `separate_wider_delim()`: split on a .hi-purple[delimiter] * `separate_wider_position()`: split on .hi-red[position] * `separate_wider_regex()`: split on a .hi-pink[regular expression] -- 2. .hi-blue[separate_longer_X]: separate one variable into .hi-blue[multiple rows] * `separate_longer_delim()`: split on a .hi-purple[delimiter] * `separate_longer_position()`: split on .hi-red[position] --- # Separate Wider .hi-medgrn[1\. separate_wider_X]: separate one variable into .hi-medgrn[multiple columns] * `separate_wider_delim()`: split on a .hi-purple[delimiter] * `separate_wider_position()`: split on .hi-red[position] * `separate_wider_regex()`: split on a .hi-pink[regular expression] ``` r df_wide <- select(col, state, county) %>% slice(83:86) df_wide ``` ``` ## # A tibble: 4 × 2 ## state county ## <chr> <chr> ## 1 California Los Angeles ## 2 California Ventura ## 3 California San Luis Obispo ## 4 California Los Angeles ``` --- # Separate Wider .hi-medgrn[1\. separate_wider_X]: separate one variable into .hi-medgrn[multiple columns] * `separate_wider_delim()`: split on a .hi-purple[delimiter] * `separate_wider_position()`: split on .hi-red[position] * `separate_wider_regex()`: split on a .hi-pink[regular expression] .font90[ ``` r df_wide %>% separate_wider_delim(county, # variable to split delim = " ", # delimiter to use names = c("cty_1", "cty_2"), # new var names too_few = "align_end", # grab the end if too few pieces too_many = "debug") # add debug col if too many pieces ``` ``` ## # A tibble: 4 × 7 ## state cty_1 cty_2 county county_ok county_pieces county_remainder ## <chr> <chr> <chr> <chr> <lgl> <int> <chr> ## 1 California Los Angeles Los Angeles TRUE 2 "" ## 2 California <NA> Ventura Ventura TRUE 1 "" ## 3 California San Luis San Luis Ob… FALSE 3 " Obispo" ## 4 California Los Angeles Los Angeles TRUE 2 "" ``` ] --- # Separate Longer .hi-blue[2\. separate_longer_X]: separate one variable into .hi-blue[multiple rows] * `separate_longer_delim()`: split on a .hi-purple[delimiter] * `separate_longer_position()`: split on .hi-red[position] ``` ## # A tibble: 10 × 2 ## state college ## <chr> <chr> ## 1 Oregon Reed College ## 2 Minnesota St. Paul College ## 3 Michigan Jackson College ## 4 Michigan Central Michigan University ## 5 Louisiana Southern University Law Center ## 6 Pennsylvania Moravian College ## 7 Texas Austin College ## 8 Florida Keiser University at Jacksonville ## 9 Minnesota Bethany Lutheran College ## 10 West Virginia West Virginia University ``` --- # Separate Longer .hi-blue[2\. separate_longer_X]: separate one variable into .hi-blue[multiple rows] * `separate_longer_delim()`: split on a .hi-purple[delimiter] * `separate_longer_position()`: split on .hi-red[position] .font90[ ``` r df_long %>% separate_longer_delim(college, delim = " ") ``` ``` ## # A tibble: 28 × 2 ## state college ## <chr> <chr> ## 1 Oregon Reed ## 2 Oregon College ## 3 Minnesota St. ## 4 Minnesota Paul ## 5 Minnesota College ## 6 Michigan Jackson ## 7 Michigan College ## 8 Michigan Central ## 9 Michigan Michigan ## 10 Michigan University ## # ℹ 18 more rows ``` ] --- # 4. Calculate Total Inches ``` r heights2 <- reported_heights %>% mutate(height_clean = str_replace_all(height, "'|,|\"|,|ft|feet|inches|and|cm", " "), height_clean = str_squish(height_clean), height_clean = str_replace(height_clean, " \\.", " ")) %>% separate_wider_delim(height_clean, # variable to split delim = " ", # delimiter to split on names = c("feet", "inches"), # new var names too_few = "align_end", # add just inches if too few too_many = "debug") %>% # add col to diagnose too many arrange(height_clean_ok) ``` --- # 4. Calculate Total Inches What new variables do we have? * `feet/inches`: the split variables we requested * `height_clean_ok`: boolean of whether we got the "right" number of pieces (1-2) * `height_clean_pieces`: numeric number of split pieces * `height_clean_remainder`: the extra string pieces when `\(>2\)` --- # 5. Deal with Extra Pieces Look at the top of the data. We can see there are a few values with extra pieces that are erroneous entries: ``` r head(heights2, 4) ``` ``` ## # A tibble: 4 × 9 ## time_stamp sex height feet inches height_clean height_clean_ok ## <chr> <chr> <chr> <chr> <chr> <chr> <lgl> ## 1 2014-10-08 19:19:33 Female Five foo… Five foot Five foot e… FALSE ## 2 2017-08-09 12:16:38 Male 7,283,465 7 283 7 283 465 FALSE ## 3 2014-09-02 13:40:36 Male 75 <NA> 75 75 TRUE ## 4 2014-09-02 13:46:59 Male 70 <NA> 70 70 TRUE ## # ℹ 2 more variables: height_clean_pieces <int>, height_clean_remainder <chr> ``` --- # 5. Deal with Extra Pieces Let's remove those (I'm starting a new dataframe to iterate further) * Use an anonymous lambda function (add `~` before function and `.x` for argument) ``` r heights3 <- heights2 %>% # replace NA mutate(across(c(feet, inches), ~ifelse(is.na(.x), 0, .x))) %>% # apply ifelse across both feet/inches variables mutate(across(c(feet, inches), ~ifelse(height_clean_pieces ==3, NA, as.numeric(.x)))) %>% # drop variables we no longer need select(-height_clean_pieces, -height_clean_remainder, -height_clean_ok) head(heights3, 2) ``` ``` ## # A tibble: 2 × 6 ## time_stamp sex height feet inches height_clean ## <chr> <chr> <chr> <dbl> <dbl> <chr> ## 1 2014-10-08 19:19:33 Female Five foot eight inches NA NA Five foot eight ## 2 2017-08-09 12:16:38 Male 7,283,465 NA NA 7 283 465 ``` --- # 6. Make Combined Inch Measurement Now add a "clean" combined inch measurement: ``` r heights3 <- heights3 %>% mutate(inches_clean = feet * 12 + inches)%>% arrange(inches_clean) ``` If you `View` the data, you'll find: * Many values between 5 and 7 which are clearly in .hi-medgrn[feet] instead of inches. * Many values between 150 and 214 which are clearly in .hi-blue[cm] instead of inches. --- # 7. Fix Units This is a good use case for `case_when()` and `between()`: ``` r heights3 <- heights3 %>% mutate(inches_clean = case_when( # First convert values in feet inches_clean >= 5 & inches_clean <= 7 ~ inches_clean*12, # Next convert cm values if between 150 and 214 between(inches_clean, 150, 214) ~ inches_clean / 2.54, # Otherwise, keep same value TRUE ~ inches_clean) ) ``` --- # 8. Check Plausible Range How many values are still outside a plausible range? ``` r heights3 %>% mutate(ok = between(inches_clean, 3.5*12, 7.5*12)) %>% count(ok) ``` ``` ## # A tibble: 3 × 2 ## ok n ## <lgl> <int> ## 1 FALSE 29 ## 2 TRUE 1061 ## 3 NA 5 ``` --- # 9. Deal with Implausible Values What should we do with our implausible values? 1. Some of these may still contain interpretable information. .hi-slate[There may be more cleaning to do.] 1. Some of them may not, in which case we probably won't use them for analysis. * Don't discard them yet! We'll come back to extreme values (aka outliers) in a couple of weeks. 1. You'll find there are also a few instances where our cleaned value appears sensible, but the original value does not. * You may need to tweak the algorithm further. --- # LOOK AT THE DISTRIBUTIONS! .hi-medgrn[Pro Tip: always look at distributions of numeric variables!] ``` r ggplot(heights3) + geom_histogram(aes(x = inches_clean), binwidth = 6) ``` <img src="data:image/png;base64,#04-Cleaning_files/figure-html/unnamed-chunk-54-1.png" width="68%" style="display: block; margin: auto;" /> --- # Aside: Regular Expressions .hi-medgrn[Regular expressions] are code to .hi-medgrn[describe patterns in strings] that are common acros basically all programming languages ``` r names <- c("Python", "SPSS", "Stata", "Julia") # Match strings that CONTAIN a lowercase "t" str_view_all(names, "t") ``` ``` ## [1] │ Py<t>hon ## [2] │ SPSS ## [3] │ S<t>a<t>a ## [4] │ Julia ``` --- # Common Regular Expressions Common regular expression operators include .pull-left[ Match strings that .hi-blue[start] with a capital "S": ``` r str_view_all(names, "^S") ``` ``` ## [1] │ Python ## [2] │ <S>PSS ## [3] │ <S>tata ## [4] │ Julia ``` ] .pull-right[ Match strings that .hi-medgrn[end] with a lowercase "a": ``` r str_view_all(names, "a$") ``` ``` ## [1] │ Python ## [2] │ SPSS ## [3] │ Stat<a> ## [4] │ Juli<a> ``` ] `^` and `$` are called .hi-slate[anchors]. --- # Common Regular Expressions .pull-left[ Match all lowercase vowels: ``` r str_view_all(names, "[aeiou]") ``` ``` ## [1] │ Pyth<o>n ## [2] │ SPSS ## [3] │ St<a>t<a> ## [4] │ J<u>l<i><a> ``` ] .pull-right[ Match everything BUT lowercase vowels: ``` r str_view_all(names, "[^aeiou]") ``` ``` ## [1] │ <P><y><t><h>o<n> ## [2] │ <S><P><S><S> ## [3] │ <S><t>a<t>a ## [4] │ <J>u<l>ia ``` ] --- # Common Regular Expressions .pull-left[ Use a vertical bar (`|`) for "or": ``` r str_view_all(names, "Stata|SPSS") ``` ``` ## [1] │ Python ## [2] │ <SPSS> ## [3] │ <Stata> ## [4] │ Julia ``` ] .pull-right[ And parentheses to clarify: ``` r str_view_all(names, "S(tata|PSS)") ``` ``` ## [1] │ Python ## [2] │ <SPSS> ## [3] │ <Stata> ## [4] │ Julia ``` ] --- # Last Remarks on Regular Expressions All kinds of regex cheat sheets and interactive testers are available via a quick Google. Regexps are hard to read and troubleshoot. Try not to get too deep into them -- you can often accomplish the same goal by breaking it up into smaller chunks. > Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. - Jamie Zawinski --- # Last Remarks on Regular Expressions This is (the start of) a real regular expression that checks whether an email address is valid: `(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)` --- # Useful Functions for Cleaning Data .hi-slate[stringr] functions we've used here: * `str_replace` and `str_replace_all`: Replace parts of strings. * `str_trim` and `str_squish`: Remove extra spaces. * `str_view_all`: Illustrates matches, to help develop regular expressions. Other .hi-medgrn[tidyverse] functions we've used: * `between`: Test whether values fall within a numerical range. * `case_when`: Multiple conditional expressions. --- # Useful Functions for Cleaning Data Other useful .hi-slate[stringr] functions: * `str_sub`: Subset strings by position of characters. * `str_detect`: Test whether a string matches a pattern. * `str_extract` and `str_extract_all`: extract matching portion(s) of a string Other useful .hi-medgrn[tidyverse] functions: * `na_if`: Set a certain value to missing. * `bind_rows`: Append two datasets that have the same variable structure. * `replace_na`: Set missing values to a certain value. --- class: inverse, middle name: numbers # Number Storage --- # Floating Point Problems Simplify this expression: `\(1-\frac{1}{49}*49\)` It's obviously 0. Now ask R: ``` r 1 - (1/49)*49 ``` -- ``` ## [1] 1.110223e-16 ``` This is called a .hi-medgrn[floating point] problem. It arises from the way computers store numbers. --- # Floating Point Problems R doesn't notice that `\(49/49\)` simplifies to 1. It just follows the order of operations. So the first thing it does is calculate: ``` r (1/49) ``` ``` ## [1] 0.02040816 ``` Which is an irrational number. So R rounds it to 53 significant digits before multiplying by 49. --- # Floating Point Problems Most of the time, 53 digits is plenty of precision. But sometimes it creates problems. Note: This explanation is actually too simple. The floating-point issue goes .hi-medgrn[deeper than just irrational numbers]. Here's another example: ``` r 1 - 0.9 - 0.1 ``` ``` ## [1] -2.775558e-17 ``` -- <br> In 1996, a floating-point error caused a European Space Agency rocket to [.hi-dkorange[self-destruct 37 seconds after it was launched.]](https://jam.dev/blog/famous-bugs-rocket-launch/) --- # Avoiding Floating Point Errors Pay attention to the data type of your variables. Avoid using logical conditions like `height==180` for numeric variables. * `height` may even read as `180` in the `View` window * But under the hood, it might still be stored as `180.000000000000173...`. What you can do instead: * .hi-medgrn[Best option:] `dplyr::near` compares numbers with a built-in tolerance. * Use `>` and `<` comparisons, or `between(height, 179.9, 180.1)`. * Convert in place: `as.integer(height) == 180` * Or with finer control: `round(height, digits=6) == 180` * If all values are integers, store the variable as an integer in the first place. --- # How to Store a Number? .hi-blue[Numeric] variables are stored in scientific notation. * Use to represent a single value, for which digits decrease in importance from left to right. * Example: My height is `172.962469405113283` cm. -- .hi-medgrn[Integer] variables lack decimal places. * Saves memory relative to numeric variables. * Stores values exactly, avoiding some floating-point problems. -- .hi-pink[Character] variables store the full sequence of digits literally. * Use when digits lack quantitative information, and each digit is equally important. * Phone numbers, credit card numbers, etc. * No chance of the right-most digits getting lost or corrupted. --- # More Variable Formats .hi-red[Dates and times] allow you to easily do math and logic on dates and times. * See tidyverse package `lubridate`. -- .hi-purple[Factors] allow you to store values as numbers, but *display* them as strings. * This is useful for sorting things like month names: "Jan", "Feb", "Mar", "Apr".... * See tidyverse packages `forcats`. --- # Memory Space Memory space quickly becomes a problem when you work with large datasets. * But R does a reasonably good job of handling storage efficiently. Logical variables are smaller than integers, which are smaller than numeric. Does it save memory to store a variable as a factor instead of a string? * This .hi-medgrn[used to be true:] factor variables only store the factor labels once. * But .hi-blue[no longer:] R uses a global string pool - each unique string is only stored once. `pryr::object_size()` will tell you how much memory an object takes up (accounting for shared elements within an object). --- class: inverse, middle name: check # Data Cleaning Checklist --- # Data Cleaning Checklist **Part A.** Get to know your data frame. *** 1. **Convert file formats**, as necessary. 1. **Import data and wrangle into a tidy layout.** 1. **Remove irrelevant, garbage, or empty** columns and rows. 1. **Identify the primary key**, or define a surrogate key. 1. **Resolve duplicates** (remove true duplicates, or redefine the primary key). 1. **Understand the definition, origin, and units** of each variable, and document as necessary. 1. **Rename variables** as necessary, to be succinct and descriptive. --- # Data Cleaning Checklist **Part B.** Check your variables. *** .hi-blue[1\. Understand patterns of missing values.] - Find out why they're missing. - Make sure they are not more widespread than you expect. - Convert other intended designations (i.e., -1 or -999) to NA. - Distinguish between missing values and true zeros. .hi-medgrn[2\. Convert to numeric] when variables are inappropriately stored as strings. Correct typos as necessary. .hi-purple[3\.Convert to date/time] format where appropriate. --- # Data Cleaning Checklist **Part B.** Check your variables. *** .hi-blue[1\. Recode binary variables] as 0/1 as necessary. (Often stored as "Yes"/"No" or 1/2.) .hi-medgrn[2\. Convert to factors] when strings take a limited set of possible values. --- # Data Cleaning Checklist **Part C.** Check the values of your quantitative variables. *** .hi-medgrn[1\. Make units and scales consistent.] Avoid having in the same variable: - Some values in meters and others in feet. - Some values in USD and others in GBP. - Some percentages as 40% and others as 0.4. - Some values as millions and others as billions. .hi-purple[2\. Perform logical checks on quantitative variables:] - Define any range restrictions each variable should satisfy, and check them (graphically too!). - Correct any violations that are indisputable data entry mistakes. - Create a flag variable to mark remaining violations. --- # Data Cleaning Checklist **Part D.** Check the rest of your values. *** .hi-purple[1\. Clean string variables.] Some common operations: - Make entirely uppercase or lowercase - Remove punctuation - Trim spaces (extra, starting, ending) - Ensure order of names is consistent - Remove uninformative words like "the" and "a" - Correct spelling inconsistencies (consider text clustering packages) .hi-medgrn[2\. Literally look at your data] tables every step of the way, to spot issues you haven't thought of, and to make sure you're actually doing what you think you're doing. --- # Data Cleaning Checklist **Part E.** Finish up the cleaning phase. *** .hi-purple[1\. Save your clean data] to disk before further manipulation (merging dataframes, transforming variables, restricting the sample). Think of the whole wrangling/cleaning/analysis pipeline as 2 big phases: - Taking messy data from external sources and making a nice, neat table that you are likely to use for multiple purposes in analysis. - Taking that nice, neat table and doing all kinds of new things with it. .hi-medgrn[2\. Record all steps in a script.] .hi-blue[3\. Never overwrite the original raw data file.] --- # Data Cleaning Tips Whenever possible, make changes to values .hi-medgrn[only by logical conditions] on one or more substantive variables - .hi-slate[*not*] by observation ID or (even worse) row number. You want the changes you make to be rule-based, for 2 reasons: * So that they're .hi-blue[general] -- able to handle upstream changes to the data. * So that they're .hi-pink[principled] -- no one can accuse you of cherry-picking. --- # Table of Contents 1. [Prologue](#prologue) 1. [Paths and Importing Data](#import) 1. [Keys and Relational Data](#keys) 1. [String Cleaning](#string) 1. [Number Storage](#numbers) 1. [Data Cleaning Checklist](#check)